This data comprises statistics on the number of cases and infection rates of sexually transmitted diseases (specifically chlamydia, gonorrhea, and early syphilis, encompassing primary, secondary, and early latent syphilis) that have been reported for California residents. The data is categorized by disease type, county, year, and gender.
The data was collected for cases with estimated diagnosis dates spanning from 2001 up to the most recent year available. It was sourced from California Confidential Morbidity Reports and Laboratory Reports, all of which were submitted to the California Department of Public Health (CDPH) by July of the current year. These reports adhered to the surveillance case definition for each respective disease.
After looking at the data, the main question of interest we wanted to investigate was : Which STD has the highest prevalence in California, and how is this disease geographically spread across the state? Further analysis was conducted to look at the year that had the highest STD rates and the difference between infection rates based on sex.
You can download the report by clicking the “Download the report” button on the top.
The STD data was retrieved from “https://data.chhs.ca.gov/dataset/stds-in-california-by-disease-county-year-and-sex”.
The geographical data was retrieved from “https://public.opendatasoft.com/explore/dataset/us-county-boundaries/export/?disjunctive.statefp&disjunctive.countyfp&disjunctive.name&disjunctive.namelsad&disjunctive.stusab&disjunctive.state_name&refine.stusab=CA”.
Merge STD and Geographic dataset. The STD data did not include any latitude and longitude coordinates, thus the second data set was introduced to conduct a proper geographic analysis. First, we merged the main data set with the geographic data set.
The combined data set has 11 columns. Among them, columns “Cases” and “Rate” have several missing values because of the “Annotation Code” variable, which prevents them from being publicized. Therefore, these missing values were removed.
The data type of the column “Rate” is chr (character), so we changed it into a numeric format.
The “County” column includes rows called “California”, which is the state not a county, so we delete them. I saved the aggregate “California” data into a new variable “Cali”.
The libraries utilized include : data.table, tidyverse, dplyr, plotly, DT, knitr
| Year | Sex | Count of Diseases | Cases Avg | Cases SD | Rate Avg | Rate SD |
|---|---|---|---|---|---|---|
| 2001 | Female | 3 | 28844 | 41110.26 | 166.36667 | 237.10256 |
| 2001 | Male | 3 | 12791 | 12039.87 | 74.50000 | 70.10193 |
| 2001 | Total | 3 | 41944 | 52847.99 | 121.56667 | 153.13407 |
| 2002 | Female | 3 | 30868 | 44249.78 | 175.83333 | 252.04256 |
| 2002 | Male | 3 | 14604 | 13451.54 | 84.03333 | 77.40390 |
| 2002 | Total | 3 | 45743 | 57455.33 | 130.90000 | 164.46115 |
| 2003 | Female | 3 | 32366 | 46081.94 | 181.96667 | 259.13036 |
| 2003 | Male | 3 | 15597 | 14607.35 | 88.60000 | 82.95330 |
| 2003 | Total | 3 | 48067 | 60327.61 | 135.83333 | 170.48444 |
| 2004 | Female | 3 | 34463 | 48094.33 | 191.80000 | 267.70022 |
| 2004 | Male | 3 | 17449 | 15802.51 | 98.10000 | 88.84329 |
| 2004 | Total | 3 | 52057 | 63409.59 | 145.63333 | 177.37061 |
| 2005 | Female | 3 | 36121 | 49313.19 | 199.70000 | 272.60435 |
| 2005 | Male | 3 | 18986 | 16880.86 | 106.06667 | 94.31905 |
| 2005 | Total | 3 | 55352 | 65824.54 | 153.83333 | 182.93885 |
| 2006 | Female | 3 | 37927 | 52367.21 | 208.13333 | 287.41702 |
| 2006 | Male | 3 | 19636 | 17662.97 | 108.93333 | 98.00573 |
| 2006 | Total | 3 | 57842 | 69800.70 | 159.56667 | 192.55101 |
| 2007 | Female | 3 | 38905 | 55006.27 | 211.76667 | 299.39459 |
| 2007 | Male | 3 | 20132 | 18946.50 | 110.76667 | 104.23657 |
| 2007 | Total | 3 | 59252 | 73844.18 | 162.10000 | 202.06019 |
| 2008 | Female | 3 | 38775 | 57240.62 | 209.33333 | 308.97620 |
| 2008 | Male | 3 | 20524 | 21116.05 | 111.96667 | 115.16550 |
| 2008 | Total | 3 | 59523 | 78460.49 | 161.53333 | 212.88817 |
| 2009 | Female | 3 | 37368 | 56184.72 | 200.56667 | 301.55877 |
| 2009 | Male | 3 | 20881 | 21624.72 | 113.20000 | 117.26291 |
| 2009 | Total | 3 | 58451 | 77871.67 | 157.66667 | 210.00991 |
| 2010 | Female | 3 | 39123 | 58485.69 | 208.30000 | 311.38964 |
| 2010 | Male | 3 | 22656 | 23108.14 | 121.93333 | 124.32990 |
| 2010 | Total | 3 | 62018 | 81629.34 | 165.96667 | 218.44346 |
| 2011 | Female | 3 | 41396 | 62336.84 | 218.56667 | 329.17718 |
| 2011 | Male | 3 | 23917 | 24238.04 | 127.46667 | 129.13173 |
| 2011 | Total | 3 | 65509 | 86568.94 | 173.73333 | 229.59082 |
| 2012 | Female | 3 | 42963 | 63126.24 | 224.83333 | 330.33084 |
| 2012 | Male | 3 | 26461 | 24785.36 | 139.66667 | 130.83273 |
| 2012 | Total | 3 | 69561 | 87651.10 | 182.80000 | 230.34715 |
| 2013 | Female | 3 | 42493 | 61236.09 | 220.80000 | 318.22305 |
| 2013 | Male | 3 | 28291 | 24720.36 | 148.10000 | 129.42523 |
| 2013 | Total | 3 | 70880 | 85474.69 | 184.90000 | 222.92422 |
| 2014 | Female | 3 | 43508 | 61524.78 | 224.46667 | 317.43203 |
| 2014 | Male | 3 | 31852 | 26810.47 | 165.33333 | 139.20206 |
| 2014 | Total | 3 | 75474 | 87593.33 | 195.30000 | 226.64997 |
| 2015 | Female | 3 | 47038 | 65282.62 | 241.10000 | 334.60704 |
| 2015 | Male | 3 | 37263 | 29645.73 | 192.06667 | 152.76784 |
| 2015 | Total | 3 | 84435 | 93888.92 | 217.00000 | 241.29758 |
| 2016 | Female | 3 | 48776 | 65825.26 | 248.73333 | 335.61897 |
| 2016 | Male | 3 | 42253 | 31958.22 | 216.46667 | 163.76515 |
| 2016 | Total | 3 | 91342 | 96335.92 | 233.43333 | 246.18985 |
| 2017 | Female | 3 | 53863 | 71320.95 | 273.33333 | 361.91430 |
| 2017 | Male | 3 | 48562 | 35759.58 | 247.46667 | 182.20319 |
| 2017 | Total | 3 | 102643 | 105316.87 | 260.96667 | 267.78516 |
| 2018 | Female | 3 | 57215 | 74913.26 | 289.36667 | 378.84868 |
| 2018 | Male | 3 | 51592 | 38244.01 | 261.83333 | 194.10565 |
| 2018 | Total | 3 | 109081 | 111573.22 | 276.33333 | 282.63286 |
| 2019 | Female | 3 | 58149 | 75295.94 | 293.70000 | 380.32811 |
| 2019 | Male | 3 | 53140 | 39700.64 | 269.33333 | 201.20331 |
| 2019 | Total | 3 | 111531 | 113640.05 | 282.13333 | 287.51773 |
| 2020 | Female | 3 | 46510 | 55128.15 | 234.76667 | 278.32101 |
| 2020 | Male | 3 | 43287 | 28498.95 | 219.40000 | 144.41731 |
| 2020 | Total | 3 | 90065 | 81740.81 | 227.76667 | 206.69713 |
| Sex | Cases | Population | Rate |
|---|---|---|---|
| Female | 2292 | 407916 | 0.5618804 |
| Male | 1032 | 409452 | 0.2520442 |
| Total | 3324 | 817368 | 0.4066712 |
Chlamydia held its position as the most prevalent STD in California from 2001 to 2020. The year 2019 witnessed the highest infection rates statewide, with Lake County bearing the brunt of this issue.
An apparent geographic pattern emerged, with the central valley reporting the highest infection rates and a gradual decrease towards the Nevada border. Additionally, a notable gender discrepancy was observed in Lake County in 2019, where females reported twice as many infections as males, highlighting the importance of tailored interventions and awareness initiatives.